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Abstract 

In multiple-input multiple-output (MIMO) systems, sphere decoding (SD) achieves performance equivalent to 
full search maximum likelihood decoding with reduced complexity. Several researchers reported techniques that 
reduced the complexity of SD further In this paper, a new technique is introduced which reduces the computational 
complexity of SD substantially, without sacrificing performance. The reduction is accomplished by deconstructing 
the decoding metric to reduce the number of computations to their minimum and exploiting the structure of a 
lattice representation. Simulation results show that this approach achieves substantial gains for the average number 
of real multiplications and real additions needed to decode one transmitted vector symbol. As an example, for a 
4x4 MIMO system, the gains in the number of multiplications are 85% with 4-QAM and 90% with 64-QAM, 
at low SNR. These complexity gains become larger when the system dimension or the modulation alphabet size 
increases. 



2 

I. INTRODUCTION 

Multiple-input multiple-output (MIMO) systems have drawn substantial research and development 
because they offer high spectral efficiency and performance in a given bandwidth. In such systems, 
the eventual goal is to minimize the bit error rate (BER) for a given signal-to-noise ratio (SNR). A 
number of different MIMO systems exist. Most of these systems result in optimum decoding techniques 
that are complicated. Therefore a number of decoding algorithms with different complexity-performance 
trade-offs have been introduced. Linear detection methods such as zero-forcing (ZF) or minimum mean 
squared error (MMSE) provide linear complexity, however their performance are suboptimal. Ordered 
successive interference cancellation decoders such as vertical Bell Laboratories layered space-time (V- 
BLAST) algorithm, show slightly better performance compared to ZF and MMSE, but suffer from error 
propagation and are still suboptimal |[T1. It is well-known that maximum likelihood (ML) detection is the 
optimum method. However, the complexity of ML algorithm in MIMO systems increases exponentially 
with the number of possible constellation points for the modulation scheme, making the algorithm 
unsuitable for practical purposes 121. Sphere decoding (SD), on the other hand, is proposed as an alternative 
for ML that provides optimal performance with reduced complexity |l3l. 

Although its complexity is much smaller than ML decoding, there is room for complexity reduction 
in conventional SD. To that end, several complexity reduction techniques for SD have been proposed. 
In im and |I51, attention is drawn to initial radius selection strategy, since an inappropriate initial radius 
can result in either a large number of lattice points to be searched, or a large number of restart actions. 
In (61, this complexity is attacked by making a proper choice to update the sphere radius. In [|7l, the 
Schnorr-Euchner (SE) strategy is applied for SD, which executes intelligent enumeration of candidate 
symbols at each level to reduce the number of visited nodes when the system dimension is small 
Channel reordering techniques can also be applied to reduce the number of visited node (HI, |l9l, (TOll . 
Other methods, such as the K-best lattice decoder ifTTl . lfT2l . can significantly reduce the complexity at 
low SNR, but with the tradeoff of BER performance degradation. 

In this paper, the complexity of SD is improved efficiently by reducing the number of operations required 
at each node to obtain ML solution for flat fading channels. This complexity reduction is achieved by 
deconstructing the decoding metric in order to reduce the computations to their minimum and exploiting 
the structure of a lattice representation of SD (U, ifTOl . In simulations, 2x2 and 4 MIMO systems with 
4-QAM and 64-QAM have been studied. In these systems, the reduction in the number of real additions 
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is in the range of 40 — 75%, and the reduction in the number of real multiplications is in the range of 
70 — 90%, without any change in performance. The complexity gains increase with the MIMO system 
dimension or the modulation alphabet size. 

The remainder of this paper is organized as follows: In Section II, the problem definition is introduced 
and a brief review of conventional SD algorithm is presented. In Section III, A new technique to implement 
the SD algorithm with low computational complexity is proposed, and the mathematical derivations for 
the complexity reduction is carried out. In Section IV, complexity comparisons for different number of 
antennas or modulation schemes are provided. Finally, a conclusion is provided in Section V. 

II. CONVENTIONAL SPHERE DECODER 

MIMO systems using square quadrature amplitude modulation (QAM) with Nt transmit and Nr receive 
antennas are considered, and the channel is assumed to be flat fading. Then, the input-output relation is 
given by 

r = Hs + v, (1) 

where r G C^"^ is the A^,. dimensional received vector symbol and C denotes the set of complex numbers; 
H G C^*^^^* is the channel matrix whose channel coefficients are independent and identically distributed 
(i.i.d.) zero-mean, unit-variance complex Gaussian random variables; s G C^* is an Nt dimensional 
transmitted complex vector with each element in square QAM format; and v G C^'^ is a zero-mean white 
Gaussian noise vector with variance matrix cr^I. 

Assuming H is known at the receiver, ML detection is 

s = arg min llr — HslP, (2) 

where x denotes the sample space for QAM modulation scalar symbols. For example, x = {^3, —1, 1, 3}^ 
for 16-QAM. 

Solving (O is known to be NP-hard, given that a full search over the entire lattice space is performed 
[fT3l . SD, on the other hand, solves ^ by searching only lattice points that lie inside a sphere of radius 
6 centering around the received vector r. 

A frequently used solution for the QAM-modulated signal model is to decompose the A^^-dimensional 
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complex- valued problem ([T]) into a 2A'^.-dimensional real-valued problem, which can be written as 
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where 5?{r} and 53 {r} denote the real and imaginary parts of r respectively (Si, [[T3l . Let 
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then ([3]) can be written as 
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Assuming A^t = A',, = in the sequel, and using the QR decomposition of H = QR, where R is an 
upper triangular matrix, and the matrix Q is unitary, SD solves 



X = argmm y 



Rxl 



(9) 



where y = Q^y. Let f2 denote the sample space for one dimension of QAM-modulated symbols, e.g., 
= {—3, —1, 1, 3} for 16-QAM, then ^ denotes a subset of Vl^^ whose elements satisfy ||y — Rx|p < 5^. 
The SD algorithm can be viewed as a pruning algorithm on a tree of depth 2N, whose branches 
correspond to elements drawn from the set f2 |l9l, [fTOl , [fT3ll . Conventional SD implements a depth-first 
search (DFS) strategy in the tree, which can achieve ML performance. 

Conventional SD starts the search process from the root of the tree, and then searches down along 
branches until the total weight of a node exceeds the square of the sphere radius, 5^. At this point, the 
corresponding branch is pruned, and any path passing through that node is declared as improbable for 
a candidate solution. Then the algorithm backtracks and proceeds down a different branch. Whenever a 
valid lattice point at the bottom level of the tree is found within the sphere, 5"^ is set to the newly-found 
point weight, thus reducing the search space for finding other candidate solutions. In the end, the path 
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from the root to the leaf that is inside the sphere with the lowest weight is chosen to be the estimated 
solution X. If no candidate solutions can be found, the tree will be searched again with a larger initial 
radius. 

III. PROPOSED SPHERE DECODER 

The complexity of SD is measured by the number of operations required per visited node multiplied 
by the number of visited nodes throughout the search procedure [fT3l . The complexity can be reduced 
by either reducing the number of visited nodes or the number of operations to be carried out at each 
node or both. Making a judicious choice of initial radius to start the algorithm with dH, [|51, executing a 
proper sphere radius update strategy |l6l, applying an improved search strategy Q, and exploiting channel 
reordering dSl, |l9l, [fTOll can all reduce the number of visited nodes. In this paper, focus is on reducing 
the average number of operations required at each node for SD. 

The node weight is given by |l9l, IfTOl . 

^(x«) = t/7(x('+i)) + «;p^(x«), (10) 

with I = 2N, ■ ■ ■ ,1, w{'x.^'^^'^^^) = 0, and tfp^(x(^^+^)) = 0, where x*^') denotes the partial vector symbol 
at layer I. The partial weight corresponding to x*^^^ is written as 

2N 

Wpy,{:>i'^^'^) = \yi -^Ri,kXk\^, (11) 

k=l 

where Rij denotes the {i^iY^ element of R, and Xi denotes the i*'^ element of x. 
A. Check-Table T 

Note that for one channel realization, both R and Vt are independent of time. In other words, to decode 
different received symbols for one channel realization, the only term in (fTTI) which depends on time is 
yi. Consequently, a check-table T is constructed to store all terms of RijX, where i?j j 7^ and x E Vt, 
before starting the tree search procedure. Equation (flOl) and (fTTI) imply that only one real multiplication 
needed instead of 2N — Z + 2 for each node to calculate the node weight by using T. As a result, the 
number of real multiplications can be significantly reduced. 

Taking the square QAM lattice structure into consideration, can be divided into two smaller sets Vti 
with negative elements and with positive elements. Take 16-QAM for example, Vt = {—3, —1, 1,3}, 



then rii = {—3, —1} and = {1, 3}. Any negative element in fii has a positive element with the same 
absolute value in 1^2 ■ Consequently, in order to build T, only terms in the form of Rijx, where Rij ^ 
and X E Vli, need to be calculated and stored. Hence, the size of T is 

m = (12) 

where Nr denotes the number of non-zero elements in matrix R, and \Q\ denotes the size of fi. 

In order to build T, both the number of terms that need to be stored and the number of real multiplications 
required are |T|. Since the channel is assumed to be flat fading, only one T needs to be built in one burst. 
If the burst length is very long, its computational complexity can be neglected. 

B. Intermediate Node Weights 
Define 

2N 

M(x('+i)) = yi- ^13) 

k=l+l 

where M(x'^^^'^^^) = 0, then (fTT)) can be rewritten as 
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Equation (fT3l) shows that M(x'^'+^)) is independent of xi, which means for any node not in the last 
level of the search tree, all children nodes share the same M(x'^'+^)). In other words, for these nodes, 
their M(x*^'+^)) values only need to be calculated once to get the whole set of weights for their children 
nodes. Consequently, the number of operations will be reduced if M(x('~'"^)) values are temporally stored 
at each node, except nodes of the last level, until the whole set of its children are visited. Based on (flOl) . 
(fT3l) . and (fT4l) . by temporally storing the M(x'^'+^)) values, the number of real additions needed to get 
all partial weights of children nodes at layer /, for a parent node of layer / + 1, reduces to 2N — I + \n\ 
from (2N — I + Note that after implementing the check-table T, storing M(x'^'+^^) values does not 

affect the number of real multiplications. 



C. New Lattice Representation 

In our previous work |l9l, ifTOl . a new lattice representation is proposed for ([8]) that enables decoding 
the real and imaginary parts of each complex symbol independently. Also, a near ML decoding algorithm. 



which combines DFS, K-best decoding, and quantization, was introduced. In this work, a different 
application of the lattice representation, which achieves no performance degradation, is employed. 
For the new lattice representation, dH)-© become 
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The structure of the new lattice representation (fT5l)-(fT8l) becomes advantageous after applying the QR 
decomposition to H. By doing so, and due to the special form of orthogonality between each pair of 
columns, all elements Rk,k+i fork = 1,3,..., 2N — 1, in the upper triangular matrix R become zero. The 
locations of these zeros introduce orthogonality between the real and imaginary parts of every detected 
symbol, which can be taken advantage of to reduce the computational complexity of SD. The following 
example is provided to explain this. 

Example: Consider a MIMO system having Nr = Nf = N = 2, and employing 4-QAM. Then, SD 
constructs a tree with 2N = A levels, where the branches coming out from each node represent the real 
values in the set ^7 = { — 1,1}. This tree is shown in Fig.[I] Now using the real- valued lattice representation 
(fT5l)-(fT8]), and applying the QR decomposition to the channel matrix, the input-output relation is given by 
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(19) 



Based on (fTTI) and (fT9l) , calculating partial node weights 



for the first level and the second level are 
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independent, similar to the third level and the forth level, because of the additional zero locations in the R 
matrix. For instance, the partial weights of node A and node B only depend on X3 but X4, and the partial 
weights of node C, node D, node E, and node F, depend on x^, X3, and xi except X2. In other words, the 
partial weights of node A and node B are equal, and only need to be calculated once. Similarly, partial 
weights of node C and node D can be used when calculating the partial weights of node E and node F, 
respectively. 

SD is then modified because of this feature. Once the tree is searched in layer I, where / is an odd 
number, partial weights of this node and all of its sibling nodes are computed, temporally stored, and 
recycled when calculating partial node weights with the same grandparent node of layer / + 2 but with 
different parent nodes of layer / + 1. 

By applying the modification, further complexity reduction is achieved beyond the reduction due to the 
check-table T and intermediate Af(x*^'+^)) values. For a node of layer / + 2, where / is an odd number, 
let a G [0, 1^2 1] denote the number of non-pruned branches for its children nodes of layer Z + 1. If a = 0, 
which means all branches of its children nodes of layer / + 1 are pruned, the number of operations needed 
stay the same. If a 7^ 0, to get all partial weights of its grandchildren nodes in layer /, the number of real 
multiplications and real additions reduce further from + to 2|f2|, and {a + l){2N — l — l + \^l\) + a 
to 2(2A^ - I - 1 + \Q\) respectively. 

IV. SIMULATION RESULTS 

To verify the proposed technique, simulations are carried out for 2x2 and 4x4 systems using 4-QAM 
and 64-QAM. Assuming 4000 symbols are transmitted during one channel realization and considering 
1000 channel realizations for each simulation, the average number of real multiplications and real additions 
for decoding one transmitted vector symbol are calculated. In the figures, conventional SD is denoted by 
CSD and proposed SD by PSD. In our simulations, 6"^ = 2a'^N is chosen as the square of initial radius. 
A lattice point lies inside a sphere of this radius with high probability [|71- 

Fig. |2] and Fig. |3] show comparisons for the number of operations between conventional SD and proposed 
SD for 2x2 systems using 4-QAM and 64-QAM. For 4-QAM, the complexity gains for the average 
numbers of real multiplications and real additions are around 70% and 45% respectively at high SNR. 
Corresponding numbers are 75% and 40% respectively at low SNR. For the 64-QAM case, gains increase 
to around 70% and 65% at high SNR respectively, while they are around 85% and 60% at low SNR 
respectively. 
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Similarly, Fig. |4] and Fig. [5] show complexity comparisons using 4-QAM and 64-QAM for 4x4 systems. 
For 4-QAM, gains for the average number of real multiplications and real additions are around 80% and 
50% respectively at high SNR, while they are around 85% and 45% respectively at low SNR. For 64- 
QAM, gains rise up to around 80% and 75% respectively at high SNR, while they are around 90% and 
70% respectively at low SNR. 

Simulation results show that proposed SD reduces the complexity significantly compared to conventional 
SD, particularly for real multiplications, which are the most expensive operations in terms of machine 
cycles, and the reduction becomes larger as the system dimension or the modulation alphabet size increases. 
An important property of our proposed SD is that the substantial complexity reduction achieved causes no 
performance degradation. The proposed technique can be combined with other techniques which reduce 
the number of visited nodes such as SE, and other near-optimal techniques such as K-best. 

V. CONCLUSIONS 

A simple and general technique to implement the SD algorithm with low computational complexity 
is proposed in this paper. The focus of the technique is on reducing the average number of operations 
required at each node for SD. The BER performance of the proposed SD is the same as conventional SD. 
Moreover, a substantial complexity reduction is achieved. Simulation results are provided for 2x2 and 
4x4 systems employing 4-QAM and 64-QAM. The complexity gains for the average numbers of real 
multiplications and real additions are substantial, ranging from 70% to 90% and 40% to 75% respectively, 
based on the number of antennas and the constellation size of modulation schemes. These complexity 
gains become larger as the system dimension or the modulation alphabet size increases. 
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C D E F 



Fig. 1. Tree structure for a 2x2 system employing 4-QAM 
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SNR indB 



Fig. 2. Average number of real multiplications vs. SNR for conventional SD and proposed SD over a 2 x 2 MIMO flat fading channel 
using 4-QAM and 64-QAM. 




Fig. 3. Average number of real additions vs. SNR for conventional SD and proposed SD over a 2 x 2 MIMO flat fading channel. 
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Fig. 4. Average number of real multiplications vs. SNR for conventional SD and proposed SD over a 4 x 4 MIMO flat fading channel. 




Fig. 5. Average number of real additions vs. SNR for conventional SD and proposed SD over a 4 x 4 MIMO flat fading channel. 



